Multimodal Image Understanding for Explainable Anomaly Detection

MUXAD
Multimodal Image Understanding for Explainable Anomaly Detection

basic research project

January 2025 - December 2027

Collaborating partners

University of Ljubljana, Faculty of Computer and Information Science

Funding

ARIS (J2-60055)

Researchers

Vitjan Zavrtanik, PhD

Project overview

With the rapid advancements in artificial intelligence, particularly in computer vision and natural language processing, deep learning has enabled impressive performance across many tasks. However, fundamental challenges remain concerning AI’s depth of understanding and its ability to explain decisions. This project addresses these issues by focusing on anomaly detection in images through multimodal models that not only detect if and where something is anomalous but also understand and explain why.

The core objective is to integrate visual and linguistic information to tackle three key challenges in contemporary AI: semantic image understanding, multimodal image understanding, and multimodal explanations. The first research challenge, Semantic Image Understanding, targets the limitations of current anomaly detection methods by enhancing models’ ability to recognize complex logical and structural anomalies beyond surface-level defects. The second challenge, Multimodal Image Understanding, develops zero-shot anomaly detection approaches that leverage vision-language models without prior exposure to specific object classes, supplemented by textual descriptions of anomalies at both task and instance levels. The third challenge, Multimodal Explanations, focuses on enriching visual anomaly explanations with textual descriptions, improving the intuitiveness and transparency of the models.

In the first project period, work was concentrated on WP1 and WP4, while establishing the initial research line for WP2. The strongest outputs already include SALAD, published at ICCV 2025, and the complementary journal paper No Label Left Behind in the Journal of Intelligent Manufacturing. Supporting results additionally cover multimodal reasoning with large language models, difficulty assessment for anomaly-detection benchmarks with DIAD, and data-efficient few-shot detection with PyramidCore.

MUXAD aims to elevate anomaly detection to a new level by harnessing the power of multimodal AI, creating models that are not only accurate but also interpretable and explainable, marking a significant step toward transparent AI systems.

Expected contributions of the project are:

Enhanced semantic image understanding for detecting complex and logical anomalies beyond surface defects.
Development of zero-shot multimodal anomaly detection methods that combine visual and linguistic data without prior exposure to specific classes.
Creation of multimodal explanation techniques that combine visual anomaly localization with rich textual descriptions.
Application of the developed methods to manufacturing visual inspection and medical imaging interpretation.

Workpackages

Development of advanced methods for semantic image understanding aimed at detecting complex anomalies, including SALAD, the current flagship ICCV 2025 result of the project (WP1).
Creation of multimodal image understanding approaches integrating vision and language for zero-shot anomaly detection, including AnomalyVFM as an emerging high-visibility line and supporting workshop/conference-style outputs such as the paper Detekcija logičnih anomalij z uporabo velikih jezikovnih modelov (WP2).
Development of methods for generating multimodal explanations combining visual and textual descriptions of anomalies (WP3).
Application of the developed methods to real-world use cases in manufacturing visual inspection and medical imaging interpretation (WP4).

Project phases:

Year 1: Focus on local and global appearance learning, object composition learning, and dataset curation (WP1).
Year 2: Activities on zero-shot anomaly detection, text-based knowledge injection, text-based weakly labelled supervision, and manufacturing visual inspection (WP2, WP4).
Year 3: Focus on text-driven explanations and modeling uncertainty in vision-language models (WP3).

Software and method resources

Project-related software and datasets:

Publications

AnomalyVFM - Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Matic Fučka, Vitjan Zavrtanik and Danijel Skočaj

IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR), 2026
ObjectCore - Efficient Few-shot Logical Anomaly Detection using Object Representations

Matic Fučka, Vitjan Zavrtanik and Danijel Skočaj

IEEE / CVF Winter Conference on Applications of Computer Vision (WACV), 2026
PyramidCore -- Feature Pyramids for Few-Shot Logical Anomaly Detection

Matic Fučka, Vitjan Zavrtanik and Danijel Skočaj

2026 IEEE 23rd Mediterranean Electrotechnical Conference (MELECON), 2026
Detekcija logičnih anomalij z uporabo velikih jezikovnih modelov

Matic Fučka and Danijel Skočaj

ERK 2025, 2025
Introducing DIAD: A Novel Metric for Assessing the Difficulty of Anomaly Detection Problems

Jure Pahor and Danijel Skočaj

ERK 2025, 2025
No Label Left Behind: A Unified Surface Defect Detection Model for all Supervision Regimes

Blaž Rolih, Matic Fučka and Danijel Skočaj

Journal of Intelligent Manufacturing, 2025
SALAD -- Semantics-Aware Logical Anomaly Detection

Matic Fučka, Vitjan Zavrtanik and Danijel Skočaj

IEEE/CVF International Conference on Computer Vision (ICCV), 2025
SuperSimpleNet: Unifying Unsupervised and Supervised Learning for Fast and Reliable Surface Defect Detection

Blaž Rolih, Matic Fučka and Danijel Skočaj

Pattern Recognition: 27th International Conference, ICPR 2024, Springer, 2024

MUXADMultimodal Image Understanding for Explainable Anomaly Detection